Search CORE

12 research outputs found

GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer

Author: Charnois Thierry
Holat Pierre
Tomeh Nadi
Zaratiana Urchade
Publication venue
Publication date: 14/11/2023
Field of study

Named Entity Recognition (NER) is essential in various Natural Language Processing (NLP) applications. Traditional NER models are effective but limited to a set of predefined entity types. In contrast, Large Language Models (LLMs) can extract arbitrary entities through natural language instructions, offering greater flexibility. However, their size and cost, particularly for those accessed via APIs like ChatGPT, make them impractical in resource-limited scenarios. In this paper, we introduce a compact NER model trained to identify any type of entity. Leveraging a bidirectional transformer encoder, our model, GLiNER, facilitates parallel entity extraction, an advantage over the slow sequential token generation of LLMs. Through comprehensive testing, GLiNER demonstrate strong performance, outperforming both ChatGPT and fine-tuned LLMs in zero-shot evaluations on various NER benchmarks.Comment: Work in progres

arXiv.org e-Print Archive

DyREx: Dynamic Query Representation for Extractive Question Answering

Author: Charnois Thierry
Holat Pierre
Khbir Niama El
Núñez Dennis
Tomeh Nadi
Zaratiana Urchade
Publication venue
Publication date: 26/10/2022
Field of study

Extractive question answering (ExQA) is an essential task for Natural Language Processing. The dominant approach to ExQA is one that represents the input sequence tokens (question and passage) with a pre-trained transformer, then uses two learned query vectors to compute distributions over the start and end answer span positions. These query vectors lack the context of the inputs, which can be a bottleneck for the model performance. To address this problem, we propose \textit{DyREx}, a generalization of the \textit{vanilla} approach where we dynamically compute query vectors given the input, using an attention mechanism through transformer layers. Empirical observations demonstrate that our approach consistently improves the performance over the standard one. The code and accompanying files for running the experiments are available at \url{https://github.com/urchade/DyReX}.Comment: Accepted at "2nd Workshop on Efficient Natural Language and Speech Processing (ENLSP-II)" @ NeurIPS 202

arXiv.org e-Print Archive

Sequence Classification Based on Delta-Free Sequential Pattern

Author: Charnois Thierry
Crémilleux Bruno
Holat Pierre
Plantevit Marc
Raïssi Chedy
Tomeh Nadi
Publication venue: HAL CCSD
Publication date: 14/12/2014
Field of study

International audienceSequential pattern mining is one of the most studied and challenging tasks in data mining. However, the extension of well-known methods from many other classical patterns to sequences is not a trivial task. In this paper we study the notion of δ-freeness for sequences. While this notion has extensively been discussed for itemsets, this work is the first to extend it to sequences. We define an efficient algorithm devoted to the extraction of δ-free sequential patterns. Furthermore, we show the advantage of the δ-free sequences and highlight their importance when building sequence classifiers, and we show how they can be used to address the feature selection problem in statistical classifiers, as well as to build symbolic classifiers which optimizes both accuracy and earliness of predictions

HAL - Normandie Université

INRIA a CCSD electronic archive server

HAL

HAL-Paris 13

Hal-Diderot

Fouille de motifs et modélisation statistique pour l'extraction de connaissances textuelles

Author: Holat Pierre
Publication venue: HAL CCSD
Publication date: 05/10/2018
Field of study

In natural language processing, two main approaches are used : machine learning and data mining. In this context, cross-referencing data mining methods based on patterns and statistical machine learning methods is apromising but hardly explored avenue. In this thesis, we present three major contributions: the introduction of delta-free patterns, used as statistical model features; the introduction of a semantic similarity constraint for the mining, calculated using a statistical model; and the introduction of sequential labeling rules, created from the patterns and selected by a statistical model.En traitement automatique des langues, deux grandes approches sont utilisées : l'apprentissage automatique et la fouille de données. Dans ce contexte, croiser les méthodes de fouille de données fondées sur les motifs et les méthodes d’apprentissage automatique statistique est une voie prometteuse mais à peine explorée. Dans cette thèse, nous présentons trois contributions majeures : l'introduction des motifs delta libres,utilisés comme descripteurs de modèle statistiques; l'introduction d'une contrainte de similarité sémantique pour la fouille, calculée grâce à un modèle statistique; l'introduction des règles séquentielles d'étiquetage,crées à partir des motifs et sélectionnées par un modèle statistique

Thèses en Ligne

HAL-Paris 13

Pattern mining and machine learning for extracting textual information

Author: Holat Pierre
Publication venue
Publication date: 05/10/2018
Field of study

En traitement automatique des langues, deux grandes approches sont utilisées : l'apprentissage automatique et la fouille de données. Dans ce contexte, croiser les méthodes de fouille de données fondées sur les motifs et les méthodes d’apprentissage automatique statistique est une voie prometteuse mais à peine explorée. Dans cette thèse, nous présentons trois contributions majeures : l'introduction des motifs delta libres,utilisés comme descripteurs de modèle statistiques; l'introduction d'une contrainte de similarité sémantique pour la fouille, calculée grâce à un modèle statistique; l'introduction des règles séquentielles d'étiquetage,crées à partir des motifs et sélectionnées par un modèle statistique.In natural language processing, two main approaches are used : machine learning and data mining. In this context, cross-referencing data mining methods based on patterns and statistical machine learning methods is apromising but hardly explored avenue. In this thesis, we present three major contributions: the introduction of delta-free patterns, used as statistical model features; the introduction of a semantic similarity constraint for the mining, calculated using a statistical model; and the introduction of sequential labeling rules, created from the patterns and selected by a statistical model

Theses.fr

Classification de texte enrichie à l'aide de motifs séquentiels

Author: Charnois Thierry
Holat Pierre
Tomeh Nadi
Publication venue: HAL CCSD
Publication date: 23/06/2015
Field of study

International audienceSequential pattern mining for text classification Most methods in text classification rely on contiguous sequences of words as features. Indeed, if we want to take non-contiguous (gappy) patterns into account, the number of features increases exponentially with the size of the text. Furthermore , most of these patterns will be mere noise. To overcome both issues, sequential pattern mining can be used to efficiently extract a smaller number of relevant, non-contiguous, features. In this paper, we compare the use of constrained frequent pattern mining and δ-free patterns as features for text classification. We show experimentally the advantages and disadvantages of each type of patterns.En classification de textes, la plupart des méthodes fondées sur des classifieurs statistiques utilisent des mots, ou des combinaisons de mots contigus, comme descripteurs. Si l'on veut prendre en compte plus d'informations le nombre de descripteurs non contigus augmente exponentiellement. Pour pallier à cette croissance, la fouille de motifs séquentiels permet d'extraire, de façon efficace, un nombre réduit de descripteurs qui sont à la fois fréquents et pertinents grâce à l'utilisation de contraintes. Dans ce papier, nous comparons l'utilisation de motifs fréquents sous contraintes et l'utilisation de motifs δ-libres, comme descripteurs. Nous montrons les avantages et inconvénients de chaque type de motif

HAL-Paris 13

High Dimensional Data Stream Clustering using Topological Representation Learning

Author: Ben-Fares Maha
Grozavu Nistor
Holat Pierre
Rastin Parisa
Publication venue: IEEE
Publication date: 04/12/2022
Field of study

Due to the high dimensionality of the data, storing the whole set of data during stream processing is impractical. Therefore, only a summary of the input stream is maintained, necessitating the development of specialized data structures that permit incremental summarization of the input stream. The problem becomes more complex when dealing with highdimensional text data due to the high sparsity. In this paper we propose a new topological unsupervised learning approach for high dimensional text data streams. The proposed method simultaneously learns the representation of the stream and cluster the data in a smaller dimension space. The evaluation of the proposed OTTC (Online Topological Text Clustering) approach and the comparison with the state of art methods is done by using the framework MOA (Massive Online Analysis), an open-source benchmarking software for evolving data streams. The proposed approach outperforms the classical methods and the obtained results are very promising for clustering high dimensional text data streams

INRIA a CCSD electronic archive server

HAL-Paris 13

Sélection globale de segments pour la reconnaissance d'entités nommées

Author: Charnois Thierry
El Khbir Niama
Holat Pierre
Tomeh Nadi
Zaratiana Urchade
Publication venue: 'Associacio catalana de Salut Laboral'
Publication date: 05/06/2023
Field of study

International audienceNamed Entity Recognition is an important task in Natural Language Processing with applications in many domains. In this paper, we describe a novel approach to named entity recognition, in which we output a set of spans (i.e., segmentations) by maximizing a global score. During training, we optimize our model by maximizing the probability of the gold segmentation. During inference, we use dynamic programming to select the best segmentation under a linear time complexity. We prove that our approach outperforms CRF and semi-CRF models for Named Entity RecognitionLa reconnaissance d'entités nommées est une tâche importante en traitement automatique du langage naturel avec des applications dans de nombreux domaines. Dans cet article, nous décrivons une nouvelle approche pour la reconnaissance d'entités nommées, dans laquelle nous produisons un ensemble de segmentations en maximisant un score global. Pendant l'entraînement, nous optimisons notre modèle en maximisant la probabilité de la segmentation correcte. Pendant l'inférence, nous utilisons la programmation dynamique pour sélectionner la meilleure segmentation avec une complexité linéaire. Nous prouvons que notre approche est supérieure aux modèles champs de Markov conditionnels et semi-CMC pour la reconnaissance d'entités nommées

HAL-Paris 13

Sequence Classification Based on Delta-Free Sequential Pattern

Author: Charnois Thierry
Crémilleux Bruno
Holat Pierre
Plantevit Marc
Raïssi Chedy
Tomeh Nadi
Publication venue: HAL CCSD
Publication date: 14/12/2014
Field of study

HAL - Normandie Université

Weakly-supervised Symptom Recognition for Rare Diseases in Biomedical Text

Author: Battistelli Delphine
Charnois Thierry
Holat Pierre
Jaulent Marie-Christine
Metivier Jean-Philippe
Tomeh Nadi
Publication venue: HAL CCSD
Publication date: 21/09/2016
Field of study

International audienceIn this paper, we tackle the issue of symptom recognition for rare diseases in biomedical texts. Symptoms typically have more complex and ambiguous structure than other biomedical named entities. Furthermore , existing resources are scarce and incomplete. Therefore, we propose a weakly-supervised framework based on a combination of two approaches: sequential pattern mining under constraints and sequence labeling. We use unannotated biomedical paper abstracts with dictionaries of rare diseases and symptoms to create our training data. Our experiments show that both approaches outperform simple projection of the dictionaries on text, and their combination is beneficial. We also introduce a novel pattern mining constraint based on semantic similarity between words inside patterns

HAL - Normandie Université

HAL-Inserm

HAL-Paris 13